This report explores Prosper Loan dataset that contain loan information for approximately 114,000 loan records.
First of all, in this section, I would like to perform some preliminary exploration of the dataset.
Original Prosper loan dataset contains approximately 114,000 observations and 81 variables. I have created a subset from original data so that I can only focus on already closed loans. Also, I have picked up only necessary variables. Thus, the dataset in this report includes 55,084 observations with 16 variables as follows.
## [1] 55084 16
Structure of the dataset is listed as follows:
## 'data.frame': 55084 obs. of 16 variables:
## $ LoanOriginalAmount : int 9425 3001 1000 4000 10000 3000 2000 4000 4000 10000 ...
## $ BorrowerRate : num 0.158 0.275 0.133 0.318 0.125 ...
## $ StatedMonthlyIncome : num 3083 2083 1667 5500 5833 ...
## $ DebtToIncomeRatio : num 0.17 0.06 0.27 0.49 0.12 0.09 0.39 0.11 0.26 0.11 ...
## $ MonthlyLoanPayment : num 330.4 123.3 33.8 173.7 334.5 ...
## $ ProsperScore : num NA NA NA 5 NA NA 5 3 9 9 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 7 2 5 7 4 4 4 6 4 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 3 3 5 2 3 5 3 3 3 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 4 3 6 4 3 2 2 3 2 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 1 1 1 2 1 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 1 1 8 1 1 6 7 2 2 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 260 451 1399 160 488 1655 1609 941 1348 ...
## $ ListingCategory..numeric.: int 0 0 0 13 0 0 15 1 1 20 ...
## $ Term : int 36 36 36 36 36 36 36 60 36 36 ...
## $ ProsperRating..numeric. : int NA NA NA 1 NA NA 3 2 6 6 ...
## $ IsDefaulted : chr "Completed" "Completed" "Completed" "Defaulted/Charge-off" ...
Summary of each variables are as follows:
## LoanOriginalAmount BorrowerRate StatedMonthlyIncome DebtToIncomeRatio
## Min. : 1000 Min. :0.0000 Min. : 0 Min. : 0.00
## 1st Qu.: 2600 1st Qu.:0.1350 1st Qu.: 2811 1st Qu.: 0.13
## Median : 4500 Median :0.1945 Median : 4167 Median : 0.20
## Mean : 6262 Mean :0.2004 Mean : 5055 Mean : 0.29
## 3rd Qu.: 8000 3rd Qu.:0.2699 3rd Qu.: 6250 3rd Qu.: 0.30
## Max. :35000 Max. :0.4975 Max. :618548 Max. :10.01
## NA's :4230
## MonthlyLoanPayment ProsperScore IncomeRange
## Min. : 0.00 Min. : 1.000 $25,000-49,999:16343
## 1st Qu.: 98.29 1st Qu.: 5.000 $50,000-74,999:12789
## Median : 172.60 Median : 6.000 Not displayed : 7737
## Mean : 223.73 Mean : 6.266 $75,000-99,999: 6442
## 3rd Qu.: 300.47 3rd Qu.: 8.000 $100,000+ : 6064
## Max. :2251.51 Max. :11.000 $1-24,999 : 4571
## NA's :29079 (Other) : 1138
## LoanStatus EmploymentStatus IsBorrowerHomeowner
## Completed :38074 Full-time :24957 False:29199
## Chargedoff :11992 Employed :16491 True :25885
## Defaulted : 5018 Not available: 5346
## Cancelled : 0 Self-employed: 2926
## Current : 0 : 2252
## FinalPaymentInProgress: 0 Part-time : 1056
## (Other) : 0 (Other) : 2056
## ProsperRating..Alpha. LoanOriginationDate
## :29079 2007-01-17 00:00:00: 137
## D : 5869 2011-08-31 00:00:00: 133
## E : 3830 2008-04-30 00:00:00: 128
## C : 3817 2008-05-28 00:00:00: 126
## HR : 3725 2011-09-30 00:00:00: 121
## A : 3608 2011-11-30 00:00:00: 120
## (Other): 5156 (Other) :54319
## ListingCategory..numeric. Term ProsperRating..numeric.
## Min. : 0.000 Min. :12.00 Min. :1.000
## 1st Qu.: 0.000 1st Qu.:36.00 1st Qu.:2.000
## Median : 1.000 Median :36.00 Median :3.000
## Mean : 2.408 Mean :36.94 Mean :3.663
## 3rd Qu.: 3.000 3rd Qu.:36.00 3rd Qu.:5.000
## Max. :20.000 Max. :60.00 Max. :7.000
## NA's :29079
## IsDefaulted
## Length:55084
## Class :character
## Mode :character
##
##
##
##
Now, I would like to look into the dataset. Firstly, I would like to explore some of the univariate plots.
First histogram is showing right skewed distribution. I have transformed the long tail data to better understand the distriution. The transformed loan amount distribution appears to have unimodal with peaks between 5,000 - 7,000.
I would like to look at other numerical variables as well.
Histogram by Borrwer rate is showing slightly right skewed unimodal distribution with its peak around 0.15.
Histogram by loan term only shows 12, 36 and 60 months. Why are they so cloear cut? According to Prosper website[1], they seem to be offering only fixed term with either 3 or 5 years.
[1] https://www.prosper.com/ Prosper website
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2811 4167 5055 6250 618548
Next, I would like to look across categorical variables.
Firstly let’s look at the distribution by Income Range.
I was expecting to see more right skewed distribution, in ohter words, tendecy like borrower with lower income tend to borrow more. However, the plot shows unimodal distribution with peak around $25,000 and $50,000.
## [1] 0.8247404
This plot shows distribution by Employment status. 82.4 % of the borrowers are employed in some form, i.e. either ‘Employed’,‘Full-time’, ‘Part-time’, or ‘Self-employed’.
## [1] 0.4699187
Above plot showing distribution by whether or not the borrower is home owner. 46.9% of the borrowers are home owners.
Next, let’s have alook at distribution by risk indicators.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 5.000 6.000 6.266 8.000 11.000 29079
Above plot by Prosper score is showing slightly left skewed unimodal distribution with its peak around 8 with median of 6 and most of them are between 5 and 8.
This plot shows distribution by Prosper rating. Borrowers seem to be distributed almost equally across each ratings, except ‘D’ has the prominently high density.
Also listed the summary for each ratings as below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2600 4500 6262 8000 35000
## loandata_completed$ProsperRating..Alpha.:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4500 6160 7920 25000
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3500 6000 8041 11000 35000
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 6500 8178 12000 25000
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7500 8559 12500 35000
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 6000 7547 12000 25000
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 5000 6033 8000 15000
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 4000 4300 5500 15900
## --------------------------------------------------------
## loandata_completed$ProsperRating..Alpha.: HR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4000 3397 4000 16800
Finally, I would like to see what are the proportion of defaulted/charged-off loans.
The first plot shows the proportion of completed loans, defaulted loans, and charged-off loans. In the second plot, I just combined defaulted loans and charge-off loans togerther, to compare with the completed loans.
## [1] 0.6911989
Almost 70 % of closed loans are completed without being defaulted/charge-off.
Original Prosperloan dataset consists of 81 variables, with approximately 114,000 observations. I have created a subset of 55,084 observations that have closed status and also only use 16 variables.
The main features of interest are Categorical factor variables listed as below:
Other variables such as BorrowerRate, StatedMonthlyIncome, DebtToIncomeRatio, MonthlyLoanPayment, and Term, likely contribute to further analysis.
Firstly I have plotted a correlation matrix as below.
Now, I would like to look into the relation among individual variables more closely.
Firstly, I would like to see the relation between Prosper Score and Monthly Income, and also Prosper rating and monthly income, that are showing the strongest correlation in above chart.
Next, I would like to see how borrowers rate differs across Prosper rating and Prosper score.
Next, I would like to see the correlation between Borrower rate and Monthly Income that also showing quite strong correlation of -0.0691 in the first correlation matrix.
The lm line showing the negative trend between Stated Monthly Income and Borrower rate. The hihger the monthly income, the lower the rate.
Next, I would like to see how the original loan amount varies by Prosper ratings.
It becomes more obvious when comparing the total in number and in amount for each ratings.
Next, I would like to explore what factors could affect whether or not a loan is to be defaulted?
Firstly, I would like to look this by loan amount.
Next, I would like to look into the distribution by monthly payment amount.
## loandata_completed$IsDefaulted: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 94.15 171.10 218.78 297.00 2251.51
## --------------------------------------------------------
## loandata_completed$IsDefaulted: Defaulted/Charge-off
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 107.4 173.7 234.8 307.2 1552.8
What can we see if we look this from the borrower rate?
Above plot compares the distribution by borrower rate.
As we have seen earlier, the borrower rate seems to be determined based on risk factors, such as Prosper rating and Prosper scores.
Let’s have a look at the distribution by rating and score to see how it differs between completed loans and defaullted loans.
## loandata_completed$IsDefaulted: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 5.000 7.000 6.537 8.000 11.000 18410
## --------------------------------------------------------
## loandata_completed$IsDefaulted: Defaulted/Charge-off
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 5.000 5.427 7.000 11.000 10669
** The proportion of AA and A seem to be much smaller for Defaulted/Charge-off loans than Completed loans.
** Lower ratings such as E and HR have larger proportion with defaulted loans.
** Defaulted/Charge-off loans has its peak around 5-6 whereas Completed loans has around 8.
** Summary shows 3rd quatile of Defaulted/charge-off loans is 7, whereas that of Completed loans is 8.
How did the feature(s) of interest vary with other features in the dataset?
I would like to look in the separated plots as below to see how the loan amount differs by Prosper rating.
Seemingly, borrowers with higher rating tend to borrow more amount with lower rate than lower graded borrowers.
Do the borrower with better rating or score repay more?
## loandata_completed$IsDefaulted: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2917 4417 5324 6583 618548
## --------------------------------------------------------
## loandata_completed$IsDefaulted: Defaulted/Charge-off
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2500 3750 4451 5417 208333
Finally I would like to see how the defaulted loans are distributed across Prosper rating and Prosper score on below scatter plot.
It seems Prosper rating and Prosper score have quite strong relation though it is difficult to say if this relation is strengthening each other. However, these 2 variables seems to be correaltes also with borrower rate.
It is not surprising to see there is strong correlation between risk indicators and borrower rate.
N/A
For defaulted loans, the borrower rate range is distributed more in higher rates, comparing to the completed loans. This may suggest that the higher the loan rate, the more difficult for borrowers to complete the loan.
Ratings and Scores seem to have quite strong correlation with borrower rate. These risk indicators semms to be used as a basis when determining borrower rate.
So, are these risk indicator effective to reduce the risk of default?
Seemingly, defaulted/charge-off loans are distributed more in lower ratings such as D, E, and HR. However, defaulted loans still can be seen even with higher ratings.